-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multi-stage] Support is_enable_group_trim agg option #14664
[Multi-stage] Support is_enable_group_trim agg option #14664
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14664 +/- ##
============================================
+ Coverage 61.75% 63.86% +2.11%
- Complexity 207 1607 +1400
============================================
Files 2436 2703 +267
Lines 133233 150717 +17484
Branches 20636 23290 +2654
============================================
+ Hits 82274 96257 +13983
- Misses 44911 47253 +2342
- Partials 6048 7207 +1159
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
e6d0b43
to
7c21d95
Compare
cc @bziobrowski |
List<RexNode> projects = projectRel.getProjects(); | ||
List<RelFieldCollation> collations = sortRel.getCollation().getFieldCollations(); | ||
if (collations.isEmpty()) { | ||
// Cannot enable group trim without sort key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is required for within segment trimming right? Is it an actual requirement for cross segment trimming?
Anyway, I'm not sure if we should make this kind of decisions here. Couldn't we populate the fields and let the actual operator decide whether an optimization can be applied or not?
More specifically: Don't you think it may be interesting to keep the limit even if we don't have collations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having non-empty collation is required for trimming within segment and cross-segment but it could be beneficial to propagate limit even if collation is missing.
That is because when limit is present (!= Integer.MAX_VALUE) combine operator might limit the number of group by keys in indexed table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit can be pushed down with or without sorting. Initially I decided to not allow pushing down limit without sorting because it can cause wrong result for aggregate group-by, but realized this can prevent distinct limit to be pushed down. Since this is not enabled by default (requires hint), I'll change it to not check collation.
public class PinotAggregateExchangeNodeInsertRule { | ||
|
||
public static class SortProjectAggregate extends RelOptRule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the javadoc in the main class applies to all rules, it would be cool to add a javadoc for each rule to explain what they do (and maybe add an example)
} | ||
|
||
@Override | ||
public void onMatch(RelOptRuleCall call) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Calcite we have onMatch
and matches
methods. This method seems to be doing both. I think it would be better to override matches
to do not apply the rule in all the cases that are filtered out here.
For the final query is the same and we would need to repeat some code, but the advantage of using matches is that if try to debug which rules apply (using MarkerFilter, as explained in https://youtu.be/_phzRNCWJfw?si=hH9ukXAS2Iml11nq&t=331).
I've just created #14680 to do not forget how to enable these logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Calcite rules also rely on return in onMatch()
if the check is not trivial to reduce overhead, e.g. ProjectToSemiJoinRule
. I'm not sure if it is a good idea to rely on which method is being invoked. IMO a safer way to detect which rules are applied is to check the actual rewrite of RelNode
@@ -249,6 +249,39 @@ | |||
"\n LogicalTableScan(table=[[default, a]])", | |||
"\n" | |||
] | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR meant to enable trimming within (minSegmentTrim) and across segments (minServerTrim) ?
If so - It'd be good to mention that :
- the former is disabled by default and requires using query option (or cluster setting)
- the latter is enabled by default but probably haven't been applied so far
I reckon It'd be good to also test this hint has on a query without order by and/or limit clause.
Apart from manual hinting I think we could propagate limit and order by details to leaf plans and enable both types of trimming if:
- order by is based on group by key(s) only (not aggregates)
- there is no HAVING clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaf stage has both segment trim and server trim. Will add some javadoc about the behavior, and we should consider adding them also as v2 hint. Right now the behavior relies on the v1 query options.
Agree on automating it when the result is guaranteed to be accurate. This will also require some v1 engine change, so will do it separately.
@@ -321,6 +321,10 @@ | |||
"description": "aggregate with skip intermediate stage hint (via hint option is_partitioned_by_group_by_keys)", | |||
"sql": "SELECT /*+ aggOptions(is_partitioned_by_group_by_keys='true') */ {tbl1}.num, COUNT(*), SUM({tbl1}.val), SUM({tbl1}.num), COUNT(DISTINCT {tbl1}.val) FROM {tbl1} WHERE {tbl1}.val >= 0 AND {tbl1}.name != 'a' GROUP BY {tbl1}.num" | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test asserting that trimming affects results ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can probably make one, but that will cause wrong (and inconsistent) result, so not sure how to test it. I've manually tested it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already added some tests in #14727.
It was tricky because in many cases trimming makes result unstable.
7c21d95
to
aaf5a05
Compare
aaf5a05
to
92a4934
Compare
Introduce
is_enable_group_trim
as an agg option to set limit and optional sort keys into aggregate node so that group trim can be applied in the leaf stage.E.g.